Team Members:
Insurance Fraud Detection
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
#data visualization
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
# Importing sklearn libraries needed
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
#model selection
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, KFold, StratifiedKFold, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, OneHotEncoder
#model evaluation
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, log_loss, fbeta_score
from sklearn.metrics import auc, roc_curve, roc_auc_score, precision_recall_curve, classification_report, confusion_matrix
#oversampling
from imblearn.over_sampling import SMOTE
#read csv file using pandas library
data=pd.read_csv('Insurance_csv.csv')
#display top five rows of the data
data.head()
| months_as_customer | age | policy_number | policy_bind_date | policy_state | policy_csl | policy_deductable | policy_annual_premium | umbrella_limit | insured_zip | ... | witnesses | police_report_available | total_claim_amount | injury_claim | property_claim | vehicle_claim | auto_make | auto_model | auto_year | fraud_reported | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 328 | 48 | 521585 | 17-10-2014 | OH | 250/500 | 1000 | 1406.91 | 0 | 466132 | ... | 2 | YES | 71610 | 6510 | 13020 | 52080 | Saab | 92x | 2004 | Y |
| 1 | 228 | 42 | 342868 | 27-06-2006 | IN | 250/500 | 2000 | 1197.22 | 5000000 | 468176 | ... | 0 | ? | 5070 | 780 | 780 | 3510 | Mercedes | E400 | 2007 | Y |
| 2 | 134 | 29 | 687698 | 06-09-2000 | OH | 100/300 | 2000 | 1413.14 | 5000000 | 430632 | ... | 3 | NO | 34650 | 7700 | 3850 | 23100 | Dodge | RAM | 2007 | N |
| 3 | 256 | 41 | 227811 | 25-05-1990 | IL | 250/500 | 2000 | 1415.74 | 6000000 | 608117 | ... | 2 | NO | 63400 | 6340 | 6340 | 50720 | Chevrolet | Tahoe | 2014 | Y |
| 4 | 228 | 44 | 367455 | 06-06-2014 | IL | 500/1000 | 1000 | 1583.91 | 6000000 | 610706 | ... | 1 | NO | 6500 | 1300 | 650 | 4550 | Accura | RSX | 2009 | N |
5 rows × 39 columns
1. Finding the missing value
2. Null Imputation
3. Outlier Analysis
#display number of rows and cols in the dataset
print("The number of records are : ", data.shape[0])
print("The number of features are : ",data.shape[1])
print("The list of features is : ", data.columns)
data.head()
The number of records are : 1000
The number of features are : 39
The list of features is : Index(['months_as_customer', 'age', 'policy_number', 'policy_bind_date',
'policy_state', 'policy_csl', 'policy_deductable',
'policy_annual_premium', 'umbrella_limit', 'insured_zip', 'insured_sex',
'insured_education_level', 'insured_occupation', 'insured_hobbies',
'insured_relationship', 'capital-gains', 'capital-loss',
'incident_date', 'incident_type', 'collision_type', 'incident_severity',
'authorities_contacted', 'incident_state', 'incident_city',
'incident_location', 'incident_hour_of_the_day',
'number_of_vehicles_involved', 'property_damage', 'bodily_injuries',
'witnesses', 'police_report_available', 'total_claim_amount',
'injury_claim', 'property_claim', 'vehicle_claim', 'auto_make',
'auto_model', 'auto_year', 'fraud_reported'],
dtype='object')
| months_as_customer | age | policy_number | policy_bind_date | policy_state | policy_csl | policy_deductable | policy_annual_premium | umbrella_limit | insured_zip | ... | witnesses | police_report_available | total_claim_amount | injury_claim | property_claim | vehicle_claim | auto_make | auto_model | auto_year | fraud_reported | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 328 | 48 | 521585 | 17-10-2014 | OH | 250/500 | 1000 | 1406.91 | 0 | 466132 | ... | 2 | YES | 71610 | 6510 | 13020 | 52080 | Saab | 92x | 2004 | Y |
| 1 | 228 | 42 | 342868 | 27-06-2006 | IN | 250/500 | 2000 | 1197.22 | 5000000 | 468176 | ... | 0 | ? | 5070 | 780 | 780 | 3510 | Mercedes | E400 | 2007 | Y |
| 2 | 134 | 29 | 687698 | 06-09-2000 | OH | 100/300 | 2000 | 1413.14 | 5000000 | 430632 | ... | 3 | NO | 34650 | 7700 | 3850 | 23100 | Dodge | RAM | 2007 | N |
| 3 | 256 | 41 | 227811 | 25-05-1990 | IL | 250/500 | 2000 | 1415.74 | 6000000 | 608117 | ... | 2 | NO | 63400 | 6340 | 6340 | 50720 | Chevrolet | Tahoe | 2014 | Y |
| 4 | 228 | 44 | 367455 | 06-06-2014 | IL | 500/1000 | 1000 | 1583.91 | 6000000 | 610706 | ... | 1 | NO | 6500 | 1300 | 650 | 4550 | Accura | RSX | 2009 | N |
5 rows × 39 columns
#checking null values in dataset
data.isnull().sum()
months_as_customer 0 age 0 policy_number 0 policy_bind_date 0 policy_state 0 policy_csl 0 policy_deductable 0 policy_annual_premium 0 umbrella_limit 0 insured_zip 0 insured_sex 0 insured_education_level 0 insured_occupation 0 insured_hobbies 0 insured_relationship 0 capital-gains 0 capital-loss 0 incident_date 0 incident_type 0 collision_type 0 incident_severity 0 authorities_contacted 0 incident_state 0 incident_city 0 incident_location 0 incident_hour_of_the_day 0 number_of_vehicles_involved 0 property_damage 0 bodily_injuries 0 witnesses 0 police_report_available 0 total_claim_amount 0 injury_claim 0 property_claim 0 vehicle_claim 0 auto_make 0 auto_model 0 auto_year 0 fraud_reported 0 dtype: int64
#checking duplicated value
data.duplicated().sum()
0
#checking unique values
data.nunique()
months_as_customer 391 age 46 policy_number 1000 policy_bind_date 951 policy_state 3 policy_csl 3 policy_deductable 3 policy_annual_premium 991 umbrella_limit 11 insured_zip 995 insured_sex 2 insured_education_level 7 insured_occupation 14 insured_hobbies 20 insured_relationship 6 capital-gains 338 capital-loss 354 incident_date 60 incident_type 4 collision_type 4 incident_severity 4 authorities_contacted 5 incident_state 7 incident_city 7 incident_location 1000 incident_hour_of_the_day 24 number_of_vehicles_involved 4 property_damage 3 bodily_injuries 3 witnesses 4 police_report_available 3 total_claim_amount 763 injury_claim 638 property_claim 626 vehicle_claim 726 auto_make 14 auto_model 39 auto_year 21 fraud_reported 2 dtype: int64
#dispay summary information of the datast
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 39 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 months_as_customer 1000 non-null int64 1 age 1000 non-null int64 2 policy_number 1000 non-null int64 3 policy_bind_date 1000 non-null object 4 policy_state 1000 non-null object 5 policy_csl 1000 non-null object 6 policy_deductable 1000 non-null int64 7 policy_annual_premium 1000 non-null float64 8 umbrella_limit 1000 non-null int64 9 insured_zip 1000 non-null int64 10 insured_sex 1000 non-null object 11 insured_education_level 1000 non-null object 12 insured_occupation 1000 non-null object 13 insured_hobbies 1000 non-null object 14 insured_relationship 1000 non-null object 15 capital-gains 1000 non-null int64 16 capital-loss 1000 non-null int64 17 incident_date 1000 non-null object 18 incident_type 1000 non-null object 19 collision_type 1000 non-null object 20 incident_severity 1000 non-null object 21 authorities_contacted 1000 non-null object 22 incident_state 1000 non-null object 23 incident_city 1000 non-null object 24 incident_location 1000 non-null object 25 incident_hour_of_the_day 1000 non-null int64 26 number_of_vehicles_involved 1000 non-null int64 27 property_damage 1000 non-null object 28 bodily_injuries 1000 non-null int64 29 witnesses 1000 non-null int64 30 police_report_available 1000 non-null object 31 total_claim_amount 1000 non-null int64 32 injury_claim 1000 non-null int64 33 property_claim 1000 non-null int64 34 vehicle_claim 1000 non-null int64 35 auto_make 1000 non-null object 36 auto_model 1000 non-null object 37 auto_year 1000 non-null int64 38 fraud_reported 1000 non-null object dtypes: float64(1), int64(17), object(21) memory usage: 304.8+ KB
#convert date object to date time data type
data.auto_year=pd.to_datetime(data.auto_year,format="%Y")
data.policy_bind_date=pd.to_datetime(data.policy_bind_date)
data.incident_date=pd.to_datetime(data.incident_date)
data["auto_year_new"] = data["auto_year"].dt.year
data['incident_month']=data['incident_date'].dt.month # incident year is 2015 as all incident happend in 2015
data['policy_bind_year']=data['policy_bind_date'].dt.year
data['policy_bind_month']=data['policy_bind_date'].dt.month
# Remove following columns-
# _c39 -- contains all nan values , so drop them
# insured_zip, policy_number , incident_location ---- Since all the values are unique and can be unique for every person ,
#they don't provide any information ; so, we'll drop these features
data=data.drop(['insured_zip', 'auto_year','policy_number' , 'policy_bind_date','incident_location','incident_date'],axis=1)
# collecting index of numerical columns
Int=data.dtypes[data.dtypes=='int64'].index
Float=data.dtypes[data.dtypes=='float64'].index
num_index=Int.append(Float)
# collecting index of categorical columns
categ_index=data.dtypes[data.dtypes=='object'].index
# collecting index of datetime columns
date_index=data.dtypes[data.dtypes=='datetime64[ns]'].index
# descriptive statistic summmary of numerical data
numdata=data.select_dtypes(exclude=['object'])
print(numdata.shape)
numdata.head(3)
(1000, 19)
| months_as_customer | age | policy_deductable | policy_annual_premium | umbrella_limit | capital-gains | capital-loss | incident_hour_of_the_day | number_of_vehicles_involved | bodily_injuries | witnesses | total_claim_amount | injury_claim | property_claim | vehicle_claim | auto_year_new | incident_month | policy_bind_year | policy_bind_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 328 | 48 | 1000 | 1406.91 | 0 | 53300 | 0 | 5 | 1 | 1 | 2 | 71610 | 6510 | 13020 | 52080 | 2004 | 1 | 2014 | 10 |
| 1 | 228 | 42 | 2000 | 1197.22 | 5000000 | 0 | 0 | 8 | 1 | 0 | 0 | 5070 | 780 | 780 | 3510 | 2007 | 1 | 2006 | 6 |
| 2 | 134 | 29 | 2000 | 1413.14 | 5000000 | 35100 | 0 | 7 | 3 | 2 | 3 | 34650 | 7700 | 3850 | 23100 | 2007 | 2 | 2000 | 6 |
# Categorical records in the dataset
cate=data.select_dtypes(include=['object'])
print(cate.shape)
cate.head(3)
(1000, 18)
| policy_state | policy_csl | insured_sex | insured_education_level | insured_occupation | insured_hobbies | insured_relationship | incident_type | collision_type | incident_severity | authorities_contacted | incident_state | incident_city | property_damage | police_report_available | auto_make | auto_model | fraud_reported | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | OH | 250/500 | MALE | MD | craft-repair | sleeping | husband | Single Vehicle Collision | Side Collision | Major Damage | Police | SC | Columbus | YES | YES | Saab | 92x | Y |
| 1 | IN | 250/500 | MALE | MD | machine-op-inspct | reading | other-relative | Vehicle Theft | ? | Minor Damage | Police | VA | Riverwood | ? | ? | Mercedes | E400 | Y |
| 2 | OH | 100/300 | FEMALE | PhD | sales | board-games | own-child | Multi-vehicle Collision | Rear Collision | Minor Damage | Police | NY | Columbus | NO | NO | Dodge | RAM | N |
#decriptive statistic summary of category data
cate.describe()
| policy_state | policy_csl | insured_sex | insured_education_level | insured_occupation | insured_hobbies | insured_relationship | incident_type | collision_type | incident_severity | authorities_contacted | incident_state | incident_city | property_damage | police_report_available | auto_make | auto_model | fraud_reported | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 |
| unique | 3 | 3 | 2 | 7 | 14 | 20 | 6 | 4 | 4 | 4 | 5 | 7 | 7 | 3 | 3 | 14 | 39 | 2 |
| top | OH | 250/500 | FEMALE | JD | machine-op-inspct | reading | own-child | Multi-vehicle Collision | Rear Collision | Minor Damage | Police | NY | Springfield | ? | NO | Saab | RAM | N |
| freq | 352 | 351 | 537 | 161 | 93 | 64 | 183 | 419 | 292 | 354 | 292 | 262 | 157 | 360 | 343 | 80 | 43 | 753 |
# afer removing those cols
data.shape
(1000, 37)
# There seem to be "?" in some of the features. So we need to extract those features that contain them to impute some values.
missing_data = []
for col in data.columns:
if '?' in data[col].values:
missing_data.append(col)
missing_data
['collision_type', 'property_damage', 'police_report_available']
#checking unique value of collision
from collections import Counter
Counter(data['collision_type'])
Counter({'Side Collision': 276,
'?': 178,
'Rear Collision': 292,
'Front Collision': 254})
#checking unique value of property+damage
Counter(data['property_damage'])
Counter({'YES': 302, '?': 360, 'NO': 338})
#checking unique value of police_report_available
Counter(data['police_report_available'])
Counter({'YES': 314, '?': 343, 'NO': 343})
#Imputing the appropriate values
data["collision_type"] = np.where(data["collision_type"] == "?", "Unknown", data["collision_type"])
data["property_damage"] = np.where(data["property_damage"] == "?", "unknown", data["property_damage"])
data["police_report_available"] = np.where(data["police_report_available"] == "?", "unknown",data["police_report_available"])
1.Univariate Analysis
2. Bivariate Analysis
3. Multivariate Analysis
1.Univariate Analysis
# Checking Distribution of numerical columns
fig,ax=plt.subplots(7,3,figsize=(20,15))
i,j=0,0
for col in numdata:
sns.distplot(data[col],ax=ax[i,j])
j+=1
if j==3:
i+=1
j=0
fig.tight_layout()
# Checking feature "months_as_customer" and "age"
sns.set(style="darkgrid")
plot_objects = plt.subplots(nrows=1, ncols=2, figsize=(20, 5))
fig, (ax1, ax2) = plot_objects
sns.distplot(data["months_as_customer"], bins=50 , ax=ax1)
ax1.set_title("Months as Customer")
ax1.set_xlabel(" ")
sns.distplot(data["age"], bins=50 , ax=ax2)
ax2.set_title("Age")
ax2.set_xlabel(" ")
Text(0.5, 0, ' ')
data[["months_as_customer", "age"]].describe()
| months_as_customer | age | |
|---|---|---|
| count | 1000.000000 | 1000.000000 |
| mean | 203.954000 | 38.948000 |
| std | 115.113174 | 9.140287 |
| min | 0.000000 | 19.000000 |
| 25% | 115.750000 | 32.000000 |
| 50% | 199.500000 | 38.000000 |
| 75% | 276.250000 | 44.000000 |
| max | 479.000000 | 64.000000 |
import plotly
import plotly.offline as py
import plotly.graph_objs as go
fraud = cate['fraud_reported'].value_counts()
label_fraud = fraud.index
size_fraud = fraud.values
colors = ['silver', 'gold']
trace = go.Pie(
labels = label_fraud, values = size_fraud, marker = dict(colors = colors), name = 'Frauds', hole = 0.3)
df = [trace]
layout = go.Layout(
title = 'Distribution of Frauds')
fig = go.Figure(data = df, layout = layout)
py.iplot(fig)
fraud = cate['insured_sex'].value_counts()
label_fraud = fraud.index
size_fraud = fraud.values
colors = ['silver', 'pink']
trace = go.Pie(
labels = label_fraud, values = size_fraud, marker = dict(colors = colors), name = 'gender', hole = 0.3)
df = [trace]
layout = go.Layout(
title = 'Distribution of gender')
fig = go.Figure(data = df, layout = layout)
py.iplot(fig)
fig, axes = plt.subplots(figsize=(23, 4))
sns.countplot("insured_occupation", data=data)
<AxesSubplot:xlabel='insured_occupation', ylabel='count'>
# Plotting the countplot for the few categorical features
sns.set(style="darkgrid")
fig, axes = plt.subplots(1, 3, figsize=(20, 4), sharey=True)
sns.countplot("policy_state", data=data, ax=axes[0])
sns.countplot("policy_csl", data=data, ax=axes[1])
sns.countplot("policy_deductable", data=data, ax=axes[2])
<AxesSubplot:xlabel='policy_deductable', ylabel='count'>
# Checking feature "policy_annual_premium"
#sns.set(style="darkgrid")
plot_objects = plt.subplots(nrows=1, ncols=1, figsize=(20, 5))
fig, ax1 = plot_objects
sns.distplot(data["policy_annual_premium"], bins=50 , ax=ax1)
ax1.set_title("Policy Annual Premium")
ax1.set_xlabel(" ")
print("The basic statistics for the feature is :\n",data["policy_annual_premium"].describe())
print("The number of unique values in the feature is :", data["policy_annual_premium"].nunique())
The basic statistics for the feature is : count 1000.000000 mean 1256.406150 std 244.167395 min 433.330000 25% 1089.607500 50% 1257.200000 75% 1415.695000 max 2047.590000 Name: policy_annual_premium, dtype: float64 The number of unique values in the feature is : 991
fig, axes = plt.subplots(figsize=(26, 4))
sns.countplot("insured_hobbies", data=data, ax=axes)
<AxesSubplot:xlabel='insured_hobbies', ylabel='count'>
# Plotting the countplot for the "insured_relationship" feature
fig, axes = plt.subplots(figsize=(12, 5))
sns.countplot("insured_relationship", data=data)
<AxesSubplot:xlabel='insured_relationship', ylabel='count'>
# Checking feature "capital-gains"
sns.set(style="darkgrid")
plot_objects = plt.subplots(nrows=1, ncols=1, figsize=(20, 5))
fig, ax1 = plot_objects
sns.distplot(data["capital-gains"], bins=50 , ax=ax1)
ax1.set_title("Capital Gains")
ax1.set_xlabel(" ")
print("The basic statistics for the feature is :\n", data["capital-gains"].describe())
print("The number of unique values in the feature is :",data["capital-gains"].nunique())
The basic statistics for the feature is : count 1000.000000 mean 25126.100000 std 27872.187708 min 0.000000 25% 0.000000 50% 0.000000 75% 51025.000000 max 100500.000000 Name: capital-gains, dtype: float64 The number of unique values in the feature is : 338
# Checking feature "capital-loss"
sns.set(style="darkgrid")
plot_objects = plt.subplots(nrows=1, ncols=1, figsize=(20, 5))
fig, ax1 = plot_objects
sns.distplot(data["capital-loss"], bins=50 , ax=ax1)
ax1.set_title("Capital Loss")
ax1.set_xlabel(" ")
print("The basic statistics for the feature is :\n", data["capital-loss"].describe())
print("The number of unique values in the feature is :",data["capital-loss"].nunique())
The basic statistics for the feature is : count 1000.000000 mean -26793.700000 std 28104.096686 min -111100.000000 25% -51500.000000 50% -23250.000000 75% 0.000000 max 0.000000 Name: capital-loss, dtype: float64 The number of unique values in the feature is : 354
# Plotting the countplot for the "incident_type" feature
fig, axes = plt.subplots(2, 2, figsize=(24, 6))
sns.countplot("incident_type", data=data, ax=axes[0,0],palette=['blue','green','red','lightgreen'])
sns.countplot("collision_type", data=data, ax=axes[0,1],palette=['blue','green','red','lightgreen'])
sns.countplot("incident_severity", data=data, ax=axes[1,0],palette=['blue','green','red','lightgreen'])
sns.countplot("authorities_contacted", data=data, ax=axes[1,1],palette=['blue','green','red','lightgreen'])
<AxesSubplot:xlabel='authorities_contacted', ylabel='count'>
sns.countplot("incident_month", data=data,palette=['blue','green','red'])
<AxesSubplot:xlabel='incident_month', ylabel='count'>
sns.countplot("police_report_available",data=data,palette=['blue','green','red'])
<AxesSubplot:xlabel='police_report_available', ylabel='count'>
# Plotting the countplot for the "incident_type" feature
sns.set(style="darkgrid")
fig, axes = plt.subplots(2, 1, figsize=(25, 10), sharey=True)
sns.countplot("incident_state", data=data, ax=axes[0],color='yellow')
sns.countplot("incident_city", data=data, ax=axes[1],color='red')
<AxesSubplot:xlabel='incident_city', ylabel='count'>
# Plotting the countplot for the "incident_type" feature
sns.set(style="darkgrid")
fig, axes = plt.subplots(1, 1, figsize=(25, 4), sharey=True)
sns.countplot("incident_hour_of_the_day", data=data, ax=axes)
<AxesSubplot:xlabel='incident_hour_of_the_day', ylabel='count'>
# Plotting the countplot for few categorical features
sns.set(style="darkgrid")
fig, axes = plt.subplots(2, 2, figsize=(25, 10), sharey=True)
sns.countplot("number_of_vehicles_involved", data=data, ax=axes[0,0],palette=['blue','green','red','lightgreen'])
sns.countplot("property_damage", data=data, ax=axes[0,1],palette=['blue','green','red','lightgreen'])
sns.countplot("bodily_injuries", data=data, ax=axes[1,0],palette=['blue','green','red','lightgreen'])
sns.countplot("witnesses", data=data, ax=axes[1,1],palette=['blue','green','red','lightgreen'])
<AxesSubplot:xlabel='witnesses', ylabel='count'>
# Checking feature "total_claim_amount"
sns.set(style="darkgrid")
plot_objects = plt.subplots(nrows=1, ncols=1, figsize=(20, 5))
fig, ax1 = plot_objects
sns.distplot(data["total_claim_amount"], bins=100 , ax=ax1,color='red')
ax1.set_title("Total Claim Amount")
ax1.set_xlabel(" ")
print("The basic statistics for the feature is :\n", data["total_claim_amount"].describe())
print("The number of unique values in the feature is :", data["total_claim_amount"].nunique())
The basic statistics for the feature is : count 1000.00000 mean 52761.94000 std 26401.53319 min 100.00000 25% 41812.50000 50% 58055.00000 75% 70592.50000 max 114920.00000 Name: total_claim_amount, dtype: float64 The number of unique values in the feature is : 763
sns.set(style="darkgrid")
plot_objects = plt.subplots(nrows=1, ncols=1, figsize=(20, 5))
fig, ax1 = plot_objects
sns.distplot(data["injury_claim"], bins=100 , ax=ax1,color='red')
ax1.set_title("Injury Claim")
ax1.set_xlabel(" ")
print("The basic statistics for the feature is :\n", data["injury_claim"].describe())
print("The number of unique values in the feature is :", data["injury_claim"].nunique())
The basic statistics for the feature is : count 1000.000000 mean 7433.420000 std 4880.951853 min 0.000000 25% 4295.000000 50% 6775.000000 75% 11305.000000 max 21450.000000 Name: injury_claim, dtype: float64 The number of unique values in the feature is : 638
# Checking feature "property_claim"
sns.set(style="darkgrid")
plot_objects = plt.subplots(nrows=1, ncols=1, figsize=(20, 5))
fig, ax1 = plot_objects
sns.distplot(data["property_claim"], bins=100 , ax=ax1,color='red')
ax1.set_title("Property Claim")
ax1.set_xlabel(" ")
print("The basic statistics for the feature is :\n", data["property_claim"].describe())
print("The number of unique values in the feature is :", data["property_claim"].nunique())
The basic statistics for the feature is : count 1000.000000 mean 7399.570000 std 4824.726179 min 0.000000 25% 4445.000000 50% 6750.000000 75% 10885.000000 max 23670.000000 Name: property_claim, dtype: float64 The number of unique values in the feature is : 626
# Checking feature "vehicle_claim"
sns.set(style="darkgrid")
plot_objects = plt.subplots(nrows=1, ncols=1, figsize=(20, 5))
fig, ax1 = plot_objects
sns.distplot(data["vehicle_claim"], bins=100 , ax=ax1,color='red')
ax1.set_title("Vehicle Claim")
ax1.set_xlabel(" ")
print("The basic statistics for the feature is :\n", data["vehicle_claim"].describe())
print("The number of unique values in the feature is :", data["vehicle_claim"].nunique())
The basic statistics for the feature is : count 1000.000000 mean 37928.950000 std 18886.252893 min 70.000000 25% 30292.500000 50% 42100.000000 75% 50822.500000 max 79560.000000 Name: vehicle_claim, dtype: float64 The number of unique values in the feature is : 726
data.head()
| months_as_customer | age | policy_state | policy_csl | policy_deductable | policy_annual_premium | umbrella_limit | insured_sex | insured_education_level | insured_occupation | ... | injury_claim | property_claim | vehicle_claim | auto_make | auto_model | fraud_reported | auto_year_new | incident_month | policy_bind_year | policy_bind_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 328 | 48 | OH | 250/500 | 1000 | 1406.91 | 0 | MALE | MD | craft-repair | ... | 6510 | 13020 | 52080 | Saab | 92x | Y | 2004 | 1 | 2014 | 10 |
| 1 | 228 | 42 | IN | 250/500 | 2000 | 1197.22 | 5000000 | MALE | MD | machine-op-inspct | ... | 780 | 780 | 3510 | Mercedes | E400 | Y | 2007 | 1 | 2006 | 6 |
| 2 | 134 | 29 | OH | 100/300 | 2000 | 1413.14 | 5000000 | FEMALE | PhD | sales | ... | 7700 | 3850 | 23100 | Dodge | RAM | N | 2007 | 2 | 2000 | 6 |
| 3 | 256 | 41 | IL | 250/500 | 2000 | 1415.74 | 6000000 | FEMALE | PhD | armed-forces | ... | 6340 | 6340 | 50720 | Chevrolet | Tahoe | Y | 2014 | 10 | 1990 | 5 |
| 4 | 228 | 44 | IL | 500/1000 | 1000 | 1583.91 | 6000000 | MALE | Associate | sales | ... | 1300 | 650 | 4550 | Accura | RSX | N | 2009 | 2 | 2014 | 6 |
5 rows × 37 columns
# Plotting the countplot for features
sns.set(style="darkgrid")
fig, axes = plt.subplots(3, 1, figsize=(25, 10), sharey=True)
sns.countplot("auto_make", data=data, ax=axes[0])
sns.countplot("auto_model", data=data, ax=axes[1])
sns.countplot("auto_year_new", data=data, ax=axes[2])
<AxesSubplot:xlabel='auto_year_new', ylabel='count'>
# Plotting the countplot for features
sns.set(style="darkgrid")
fig, axes = plt.subplots(1, 1, figsize=(26, 10), sharey=True)
sns.countplot("policy_bind_year", data=data, ax=axes)
<AxesSubplot:xlabel='policy_bind_year', ylabel='count'>
# Plotting the countplot for features
sns.set(style="darkgrid")
fig, axes = plt.subplots(1, 1, figsize=(25, 10), sharey=True)
sns.countplot("incident_month", data=data, ax=axes)
<AxesSubplot:xlabel='incident_month', ylabel='count'>
# heatmap from those with at least 0.3 magnitude in corr, includeing the DV
corlst=['age','months_as_customer','total_claim_amount', 'injury_claim', 'property_claim','vehicle_claim', 'incident_severity','fraud_reported']
corr_data = data[corlst]
corr=round(corr_data.corr(),2)
fix, ax = plt.subplots(figsize=(15,5))
ax = sns.heatmap(corr, ax=ax,annot=True)
plt.show()
# encoding
fraud_map={'Y':1,'N':0}
data.fraud_reported =data.fraud_reported.map(fraud_map)
3. Multivariate Analysis</p>
sns.catplot(data=data, x="fraud_reported", y="total_claim_amount", kind='violin',palette=['blue','green'])
<seaborn.axisgrid.FacetGrid at 0x26d3ee8ba90>
sns.set(style="darkgrid")
fig, axes = plt.subplots(1, 3, figsize=(20, 4), sharey=True)
sns.countplot(x="policy_state", data=data, hue="fraud_reported", ax=axes[0],palette=['blue','green'])
sns.countplot(x="policy_csl", data=data, hue="fraud_reported", ax=axes[1],palette=['blue','green'])
sns.countplot(x="policy_deductable", data=data, hue="fraud_reported", ax=axes[2],palette=['blue','green'])
<AxesSubplot:xlabel='policy_deductable', ylabel='count'>
# Plotting nuumber of customer in fraud report base on gender, education level
sns.set(style="darkgrid")
fig, axes = plt.subplots(2, 1, figsize=(25, 8), sharey=True)
sns.countplot(x="insured_sex", data=data, hue="fraud_reported", ax=axes[0],palette=['blue','green'])
sns.countplot(x="insured_education_level", data=data, hue="fraud_reported", ax=axes[1],palette=['blue','green'])
<AxesSubplot:xlabel='insured_education_level', ylabel='count'>
#plotting num of cutomer in fraud report w.r.t occupatin and thier hobbies
sns.set(style="darkgrid")
fig, axes = plt.subplots(2, 1, figsize=(25, 8), sharey=True)
sns.countplot(x="insured_occupation", data=data, hue="fraud_reported", ax=axes[0],palette=['blue','green'])
sns.countplot(x="insured_hobbies", data=data, hue="fraud_reported", ax=axes[1],palette=['blue','green'])
<AxesSubplot:xlabel='insured_hobbies', ylabel='count'>
# Plotting number of cutomer in fraud reported based incident type, collision type, incident serverty, stae,city
sns.set(style="darkgrid")
fig, axes = plt.subplots(3, 2, figsize=(25, 12), sharey=True)
sns.countplot(x="incident_type", data=data, hue="fraud_reported", ax=axes[0,0],palette=['blue','green'])
sns.countplot(x="collision_type", data=data, hue="fraud_reported", ax=axes[0,1],palette=['blue','green'])
sns.countplot(x="incident_severity", data=data, hue="fraud_reported", ax=axes[1,0],palette=['blue','green'])
sns.countplot(x="authorities_contacted", data=data, hue="fraud_reported", ax=axes[1,1],palette=['blue','green'])
sns.countplot(x="incident_state", data=data, hue="fraud_reported", ax=axes[2,0],palette=['blue','green'])
sns.countplot(x="incident_city", data=data, hue="fraud_reported", ax=axes[2,1],palette=['blue','green'])
<AxesSubplot:xlabel='incident_city', ylabel='count'>
sns.set(style="darkgrid")
fig, axes = plt.subplots(1, 1, figsize=(25, 8), sharey=True)
sns.countplot(x="incident_hour_of_the_day", data=data, hue="fraud_reported",palette=['blue','green'])
<AxesSubplot:xlabel='incident_hour_of_the_day', ylabel='count'>
# Plotting the features against dependent features
sns.set(style="darkgrid")
fig, axes = plt.subplots(3, 2, figsize=(25, 12), sharey=True)
sns.countplot(x="number_of_vehicles_involved", data=data, hue="fraud_reported", ax=axes[0,0],palette=['blue','green'])
sns.countplot(x="property_damage", data=data, hue="fraud_reported", ax=axes[0,1],palette=['blue','green'])
sns.countplot(x="bodily_injuries", data=data, hue="fraud_reported", ax=axes[1,0],palette=['blue','green'])
sns.countplot(x="witnesses", data=data, hue="fraud_reported", ax=axes[1,1],palette=['blue','green'])
sns.countplot(x="police_report_available", data=data, hue="fraud_reported", ax=axes[2,0],palette=['blue','green'])
sns.countplot(x="auto_make", data=data, hue="fraud_reported", ax=axes[2,1],palette=['blue','green'])
<AxesSubplot:xlabel='auto_make', ylabel='count'>
sns.set(style="darkgrid")
fig, axes = plt.subplots(2, 1, figsize=(25, 15), sharey=True)
sns.countplot(x="auto_model", data=data, hue="fraud_reported", ax=axes[0],palette=['blue','green'])
sns.countplot(x="auto_year_new", data=data, hue="fraud_reported", ax=axes[1],palette=['blue','green'])
<AxesSubplot:xlabel='auto_year_new', ylabel='count'>
# Plotting the features against dependent features
sns.set(style="darkgrid")
fig, axes = plt.subplots(2, 1, figsize=(20, 8), sharey=True)
sns.countplot(x="policy_bind_year", data=data, hue="fraud_reported", ax=axes[0],palette=['blue','green'])
sns.countplot(x="incident_month", data=data, hue="fraud_reported", ax=axes[1],palette=['blue','green'])
<AxesSubplot:xlabel='incident_month', ylabel='count'>
#severity and claims and fraud
sns.catplot(y="incident_type", col="incident_severity", hue="fraud_reported", data=data, palette=['blue','green'], kind="count")
<seaborn.axisgrid.FacetGrid at 0x26d3ea35fd0>
sns.catplot(y="collision_type", col="incident_severity", hue="fraud_reported", data=data, palette=['blue','green'], kind="count")
<seaborn.axisgrid.FacetGrid at 0x26d3f520430>
sns.catplot(y="insured_sex", col="incident_severity", hue="fraud_reported", data=data,palette=['blue','green'], kind="count")
<seaborn.axisgrid.FacetGrid at 0x26d3f049df0>
fig, axes = plt.subplots(2,2, figsize=(25,8))
axes[0][0] = sns.barplot(x="incident_severity", y="injury_claim",
hue="fraud_reported", data=data, ax=axes[0][0],palette=['blue','green']);
axes[0][1] = sns.barplot(x="incident_severity", y="vehicle_claim",
hue="fraud_reported", data=data, ax=axes[0][1],palette=['blue','green']);
axes[1][0] = sns.barplot(x="incident_severity", y="property_claim",
hue="fraud_reported", data=data, ax=axes[1][0],palette=['blue','green']);
axes[1][1] = sns.barplot(x="incident_severity", y="total_claim_amount",
hue="fraud_reported", data=data, ax=axes[1][1],palette=['blue','green']);
#more severe accidents seem to only be present with collision
incident = pd.crosstab(data['incident_type'], data['incident_severity'])
incident.plot(kind='bar',figsize=(10,4),color=['blue','green','red','lightgreen'])
plt.xticks(rotation=45)
plt.title("incident by severity and incident type");
incident = pd.crosstab(data['collision_type'], data['incident_severity'])
incident.plot(kind='bar', figsize=(10,4),color=['blue','green','red','lightgreen'])
plt.xticks(rotation=45)
plt.title("incident by severity and collision type")
Text(0.5, 1.0, 'incident by severity and collision type')
# fiercer cars like dodge, subaru, saab, merc get into mor severe accident
incident = pd.crosstab(data['auto_make'],data['incident_severity'])
incident.plot(kind='bar',figsize=(25,5),color=['blue','green','red','lightgreen'])
plt.xticks(rotation=45)
plt.title("incident by severity and automake")
Text(0.5, 1.0, 'incident by severity and automake')
# Checking for Extreme Outliers in data
fig,ax=plt.subplots(7,3,figsize=(20,10))
i,j=0,0
for col in num_index:
sns.boxplot(data[col],ax=ax[i,j],whis=3)
j+=1
if j==3:
i+=1
j=0
fig.tight_layout()
# creating bins for umbrella_limit
quantile_list = [0, .25, .5, .75, 1.]
quantiles = data["umbrella_limit"].quantile(quantile_list)
quantiles
0.00 -1000000.0 0.25 0.0 0.50 0.0 0.75 0.0 1.00 10000000.0 Name: umbrella_limit, dtype: float64
# 0 means no umbrella_limit and 1 means there is umbrella_limit
data["umbrella_limit"] = np.where(data["umbrella_limit"] > 0, 1, 0)
data.head()
| months_as_customer | age | policy_state | policy_csl | policy_deductable | policy_annual_premium | umbrella_limit | insured_sex | insured_education_level | insured_occupation | ... | injury_claim | property_claim | vehicle_claim | auto_make | auto_model | fraud_reported | auto_year_new | incident_month | policy_bind_year | policy_bind_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 328 | 48 | OH | 250/500 | 1000 | 1406.91 | 0 | MALE | MD | craft-repair | ... | 6510 | 13020 | 52080 | Saab | 92x | 1 | 2004 | 1 | 2014 | 10 |
| 1 | 228 | 42 | IN | 250/500 | 2000 | 1197.22 | 1 | MALE | MD | machine-op-inspct | ... | 780 | 780 | 3510 | Mercedes | E400 | 1 | 2007 | 1 | 2006 | 6 |
| 2 | 134 | 29 | OH | 100/300 | 2000 | 1413.14 | 1 | FEMALE | PhD | sales | ... | 7700 | 3850 | 23100 | Dodge | RAM | 0 | 2007 | 2 | 2000 | 6 |
| 3 | 256 | 41 | IL | 250/500 | 2000 | 1415.74 | 1 | FEMALE | PhD | armed-forces | ... | 6340 | 6340 | 50720 | Chevrolet | Tahoe | 1 | 2014 | 10 | 1990 | 5 |
| 4 | 228 | 44 | IL | 500/1000 | 1000 | 1583.91 | 1 | MALE | Associate | sales | ... | 1300 | 650 | 4550 | Accura | RSX | 0 | 2009 | 2 | 2014 | 6 |
5 rows × 37 columns
sns.set(style="darkgrid")
fig, axes = plt.subplots(1, 1, figsize=(25, 8), sharey=True)
sns.countplot(x="auto_model", data=data, hue="fraud_reported")
<AxesSubplot:xlabel='auto_model', ylabel='count'>
# contingency table
automodel=pd.crosstab(index=data['auto_model'],columns=data['fraud_reported'])
#shape of the contingency table
automodel.shape
(39, 2)
# get index
automodel.iloc[20].values
array([25, 10], dtype=int64)
from scipy import stats
(chi2,p,dof,_)= stats.chi2_contingency([automodel.iloc[0].values,automodel.iloc[1].values,automodel.iloc[2].values,
automodel.iloc[3].values,automodel.iloc[4].values,automodel.iloc[5].values,
automodel.iloc[6].values,automodel.iloc[7].values,automodel.iloc[8].values,
automodel.iloc[9].values,automodel.iloc[10].values,automodel.iloc[11].values,
automodel.iloc[12].values,automodel.iloc[13].values,automodel.iloc[14].values,
automodel.iloc[15].values,automodel.iloc[16].values,automodel.iloc[17].values,
automodel.iloc[18].values,automodel.iloc[19].values,automodel.iloc[20].values,
automodel.iloc[21].values,automodel.iloc[22].values,automodel.iloc[23].values,
automodel.iloc[24].values,automodel.iloc[25].values,automodel.iloc[26].values,
automodel.iloc[27].values,automodel.iloc[28].values,automodel.iloc[29].values,
automodel.iloc[30].values,automodel.iloc[31].values,automodel.iloc[32].values,
automodel.iloc[33].values,automodel.iloc[34].values,automodel.iloc[35].values,
automodel.iloc[36].values,automodel.iloc[37].values,automodel.iloc[38].values])
print("chi2 :",chi2)
print("p_value :",p)
print("Degree of freedom :",dof)
chi2 : 46.65817014569841 p_value : 0.15826457876312205 Degree of freedom : 38
# since p value is 0.15 so we failed to reject null hypothesis
# as result of chi square test, drop auto_model
data=data.drop(['auto_model'],axis=1)
# Encoding
1. Data Encoding
2. Data Scaling
3. SMOTE Analysis
1. Data Encoding
# One Hot Encoding
sex_map={'MALE':1,'FEMALE':0}
data.insured_sex=data.insured_sex.map(sex_map)
# Mean Encoding for normal data
#It gives mean of fraud_reported value of each unique value of particulary col
policy_state_map=data.groupby(['policy_state'])['fraud_reported'].mean().to_dict()
# eg. policy_state_map = { 'IL': 0.22781065088757396,'IN': 0.25483870967741934,'OH': 0.2585227272727273 }
data.policy_state=data.policy_state.map(policy_state_map)
policy_csl_map=data.groupby(['policy_csl'])['fraud_reported'].mean().to_dict()
data.policy_csl=data.policy_csl.map(policy_csl_map)
insured_hobby_map=data.groupby(['insured_hobbies'])['fraud_reported'].mean().to_dict()
data.insured_hobbies=data.insured_hobbies.map(insured_hobby_map)
insured_relation_map=data.groupby(['insured_relationship'])['fraud_reported'].mean().to_dict()
data.insured_relationship=data.insured_relationship.map(insured_relation_map)
collision_map=data.groupby(['collision_type'])['fraud_reported'].mean().to_dict()
data.collision_type=data.collision_type.map(collision_map)
incident_state_map=data.groupby(['incident_state'])['fraud_reported'].mean().to_dict()
data.incident_state=data.incident_state.map(incident_state_map)
incident_city_map=data.groupby(["incident_city"])['fraud_reported'].mean().to_dict()
data.incident_city=data.incident_city.map(incident_city_map)
auto_make_map=data.groupby(["auto_make"])['fraud_reported'].mean().to_dict()
data.auto_make=data.auto_make.map(auto_make_map)
property_damage_map=data.groupby(["property_damage"])['fraud_reported'].mean().to_dict()
data.property_damage=data.property_damage.map(property_damage_map)
police_report_available_map=data.groupby(["police_report_available"])['fraud_reported'].mean().to_dict()
data.police_report_available=data.police_report_available.map(police_report_available_map)
# Target Guided Ordinal Encoding
occupation_map={j:i+5 for i,j in enumerate(data.groupby(["insured_occupation"])['fraud_reported'].mean().sort_values().index)}
data.insured_occupation=data.insured_occupation.map(occupation_map)
severity_map={j:i+5 for i,j in enumerate(data.groupby(["incident_severity"])['fraud_reported'].mean().sort_values().index)}
data.incident_severity=data.incident_severity.map(severity_map)
authorities_map={j:i+5 for i,j in enumerate(data.groupby(["authorities_contacted"])['fraud_reported'].mean().sort_values().index)}
data.authorities_contacted=data.authorities_contacted.map(authorities_map)
# Freq encoding
incident_type_map=data.incident_type.value_counts().to_dict()
data.incident_type=data.incident_type.map(incident_type_map)
# Ordinal encoding
education_map={'High School':1,'College':2,'Associate':3,'Masters':4,'JD':5,'MD':6,'PhD':7}
data.insured_education_level=data.insured_education_level.map(education_map)
data.isnull().sum()
months_as_customer 0 age 0 policy_state 0 policy_csl 0 policy_deductable 0 policy_annual_premium 0 umbrella_limit 0 insured_sex 0 insured_education_level 0 insured_occupation 0 insured_hobbies 0 insured_relationship 0 capital-gains 0 capital-loss 0 incident_type 0 collision_type 0 incident_severity 0 authorities_contacted 0 incident_state 0 incident_city 0 incident_hour_of_the_day 0 number_of_vehicles_involved 0 property_damage 0 bodily_injuries 0 witnesses 0 police_report_available 0 total_claim_amount 0 injury_claim 0 property_claim 0 vehicle_claim 0 auto_make 0 fraud_reported 0 auto_year_new 0 incident_month 0 policy_bind_year 0 policy_bind_month 0 dtype: int64
2. Data Scaling
# Applying MinMaxScaler on continuous features
scaler = MinMaxScaler()
data_unscaled = data.copy()
data["months_as_customer"] = scaler.fit_transform(data[["months_as_customer"]])
data["age"] = scaler.fit_transform(data[["age"]])
data["policy_annual_premium"] = scaler.fit_transform(data[["policy_annual_premium"]])
data["injury_claim"] = scaler.fit_transform(data[["injury_claim"]])
data["property_claim"] = scaler.fit_transform(data[["property_claim"]])
data["vehicle_claim"] = scaler.fit_transform(data[["vehicle_claim"]])
data['total_claim_amount']= scaler.fit_transform(data[['total_claim_amount']])
data["policy_bind_year"] = scaler.fit_transform(data[["policy_bind_year"]])
data["auto_year_new"] = scaler.fit_transform(data[["auto_year_new"]])
data["capital-gains"] = scaler.fit_transform(data[["capital-gains"]])
data["capital-loss"] = scaler.fit_transform(data[["capital-loss"]])
data['incident_hour_of_the_day']= scaler.fit_transform(data[['incident_hour_of_the_day']])
data['number_of_vehicles_involved']= scaler.fit_transform(data[['number_of_vehicles_involved']])
data['bodily_injuries']=scaler.fit_transform(data[['bodily_injuries']])
data['witnesses']=scaler.fit_transform(data[['witnesses']])
data['incident_month']=scaler.fit_transform(data[['incident_month']])
data['policy_bind_month']=scaler.fit_transform(data[['policy_bind_month']])
data['incident_type']=scaler.fit_transform(data[['incident_type']])
data['umbrella_limit']=scaler.fit_transform(data[['umbrella_limit']])
data['incident_severity']=scaler.fit_transform(data[['incident_severity']])
data['authorities_contacted']=scaler.fit_transform(data[['authorities_contacted']])
data['insured_education_level']=scaler.fit_transform(data[['insured_education_level']])
data['insured_occupation']=scaler.fit_transform(data[['insured_occupation']])
data['policy_deductable']=scaler.fit_transform(data[['policy_deductable']])
data.head()
| months_as_customer | age | policy_state | policy_csl | policy_deductable | policy_annual_premium | umbrella_limit | insured_sex | insured_education_level | insured_occupation | ... | total_claim_amount | injury_claim | property_claim | vehicle_claim | auto_make | fraud_reported | auto_year_new | incident_month | policy_bind_year | policy_bind_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.684760 | 0.644444 | 0.258523 | 0.262108 | 0.333333 | 0.603112 | 0.0 | 1 | 0.833333 | 0.846154 | ... | 0.622801 | 0.303497 | 0.550063 | 0.654296 | 0.225000 | 1 | 0.45 | 0.000000 | 0.96 | 0.818182 |
| 1 | 0.475992 | 0.511111 | 0.254839 | 0.262108 | 1.000000 | 0.473214 | 1.0 | 1 | 0.833333 | 0.461538 | ... | 0.043285 | 0.036364 | 0.032953 | 0.043276 | 0.338462 | 1 | 0.60 | 0.000000 | 0.64 | 0.454545 |
| 2 | 0.279749 | 0.222222 | 0.258523 | 0.257880 | 1.000000 | 0.606972 | 1.0 | 0 | 1.000000 | 0.615385 | ... | 0.300906 | 0.358974 | 0.162653 | 0.289722 | 0.250000 | 0 | 0.60 | 0.090909 | 0.40 | 0.454545 |
| 3 | 0.534447 | 0.488889 | 0.227811 | 0.262108 | 1.000000 | 0.608582 | 1.0 | 0 | 1.000000 | 0.538462 | ... | 0.551298 | 0.295571 | 0.267850 | 0.637187 | 0.276316 | 1 | 0.95 | 0.818182 | 0.00 | 0.363636 |
| 4 | 0.475992 | 0.555556 | 0.227811 | 0.216667 | 0.333333 | 0.712760 | 1.0 | 1 | 0.333333 | 0.615385 | ... | 0.055739 | 0.060606 | 0.027461 | 0.056359 | 0.191176 | 0 | 0.70 | 0.090909 | 0.96 | 0.454545 |
5 rows × 36 columns
# Finding Higly Correlated Columns
def correlation(data,threshold):
col_corr=set()
cor=data.corr()
for i in range(len(cor.columns)):
for j in range(len(cor.columns)):
if (abs(cor.iloc[i,j]) > threshold) and i!=j:
if (cor.columns[j] in col_corr) or (cor.columns[i] in col_corr):
continue
print("\n",cor.columns[i],"-----",cor.columns[j])
print(abs(cor.iloc[i,j]))
colname=cor.columns[i]
col_corr.add(colname)
return col_corr
a=correlation(data.drop(['fraud_reported'],axis=1),0.8)
print('\n',a)
months_as_customer ----- age
0.9220983225789815
incident_type ----- collision_type
0.9551929841805376
total_claim_amount ----- injury_claim
0.8050253630561779
{'incident_type', 'months_as_customer', 'total_claim_amount'}
data=data.drop(['age','collision_type','injury_claim'],axis=1)
3. SMOTE Analysis
from imblearn.over_sampling import SMOTE
X=data.drop(['fraud_reported'],axis=1)
y=data['fraud_reported']
smote=SMOTE(k_neighbors=12,sampling_strategy='minority')
X_smote,y_smote=smote.fit_resample(X,y)
df=pd.DataFrame(y_smote)
df
| fraud_reported | |
|---|---|
| 0 | 1 |
| 1 | 1 |
| 2 | 0 |
| 3 | 1 |
| 4 | 0 |
| ... | ... |
| 1501 | 1 |
| 1502 | 1 |
| 1503 | 1 |
| 1504 | 1 |
| 1505 | 1 |
1506 rows × 1 columns
sns.set(style="darkgrid")
fig, axes = plt.subplots(2, 1, figsize=(20, 10), sharey=True)
sns.countplot("fraud_reported", hue="fraud_reported",data=data, ax=axes[0],palette=['blue','green'])
sns.countplot("fraud_reported",data=df, ax=axes[1],palette=['blue','green'])
plt.title("After smote")
Text(0.5, 1.0, 'After smote')
data=pd.merge(X_smote,y_smote,left_index=True,right_index=True)
datadata=pd.merge(X_smote,y_smote,left_index=True,right_index=True)
data
| months_as_customer | policy_state | policy_csl | policy_deductable | policy_annual_premium | umbrella_limit | insured_sex | insured_education_level | insured_occupation | insured_hobbies | ... | police_report_available | total_claim_amount | property_claim | vehicle_claim | auto_make | auto_year_new | incident_month | policy_bind_year | policy_bind_month | fraud_reported | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.684760 | 0.258523 | 0.262108 | 0.333333 | 0.603112 | 0.0 | 1 | 0.833333 | 0.846154 | 0.195122 | ... | 0.229299 | 0.622801 | 0.550063 | 0.654296 | 0.225000 | 0.450000 | 0.000000 | 0.960000 | 0.818182 | 1 |
| 1 | 0.475992 | 0.254839 | 0.262108 | 1.000000 | 0.473214 | 1.0 | 1 | 0.833333 | 0.461538 | 0.265625 | ... | 0.259475 | 0.043285 | 0.032953 | 0.043276 | 0.338462 | 0.600000 | 0.000000 | 0.640000 | 0.454545 | 1 |
| 2 | 0.279749 | 0.258523 | 0.257880 | 1.000000 | 0.606972 | 1.0 | 0 | 1.000000 | 0.615385 | 0.291667 | ... | 0.250729 | 0.300906 | 0.162653 | 0.289722 | 0.250000 | 0.600000 | 0.090909 | 0.400000 | 0.454545 | 0 |
| 3 | 0.534447 | 0.227811 | 0.262108 | 1.000000 | 0.608582 | 1.0 | 0 | 1.000000 | 0.538462 | 0.291667 | ... | 0.250729 | 0.551298 | 0.267850 | 0.637187 | 0.276316 | 0.950000 | 0.818182 | 0.000000 | 0.363636 | 1 |
| 4 | 0.475992 | 0.227811 | 0.216667 | 0.333333 | 0.712760 | 1.0 | 1 | 0.333333 | 0.615385 | 0.291667 | ... | 0.250729 | 0.055739 | 0.027461 | 0.056359 | 0.191176 | 0.700000 | 0.090909 | 0.960000 | 0.454545 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1501 | 0.725728 | 0.244197 | 0.257880 | 0.000000 | 0.372629 | 0.0 | 0 | 0.488816 | 0.820597 | 0.742857 | ... | 0.255396 | 0.723901 | 0.472128 | 0.778104 | 0.243965 | 0.629868 | 0.048505 | 0.362684 | 0.485048 | 1 |
| 1502 | 0.346772 | 0.254174 | 0.257880 | 1.000000 | 0.636392 | 0.0 | 0 | 0.187150 | 0.842372 | 0.224644 | ... | 0.250944 | 0.390618 | 0.174333 | 0.410720 | 0.335980 | 0.461061 | 0.095378 | 0.678034 | 0.536516 | 1 |
| 1503 | 0.284399 | 0.258523 | 0.260931 | 1.000000 | 0.602277 | 0.0 | 0 | 0.953604 | 0.957173 | 0.116279 | ... | 0.235265 | 0.431274 | 0.268437 | 0.479333 | 0.265605 | 0.000000 | 0.393613 | 0.713405 | 0.524818 | 1 |
| 1504 | 0.286135 | 0.254839 | 0.227275 | 0.077819 | 0.592089 | 0.0 | 0 | 0.666667 | 0.702488 | 0.250225 | ... | 0.252430 | 0.699635 | 0.322221 | 0.819074 | 0.213692 | 0.401748 | 0.718080 | 0.741323 | 0.887867 | 1 |
| 1505 | 0.719633 | 0.257092 | 0.244455 | 0.333333 | 0.633328 | 0.0 | 0 | 0.639098 | 0.786389 | 0.161701 | ... | 0.229299 | 0.635624 | 0.583872 | 0.657717 | 0.234712 | 0.605389 | 0.000000 | 0.820150 | 0.818182 | 1 |
1506 rows × 33 columns
1. Logistic Regression
2. RandomForestClassifier
3. XGBoostClassifier
#separate independent variable and target variable
X=data.loc[:,"months_as_customer":"policy_bind_month"]
y=data.fraud_reported
# traing the data, test_size=0.3 means testing data is 30%
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=40)
print('X_train:',X_train.shape)
print('y_train:',y_train.shape)
print('X_test:',X_test.shape)
print('y_test:',y_test.shape)
X_train: (1054, 32) y_train: (1054,) X_test: (452, 32) y_test: (452,)
#model_report method for finding optimal threthold for logistic regression, randomforestclassifier,xgboost model
def model_report(model_name, model, X_train, y_train, X_test, y_test):
print('\nSearch for OPTIMAL THRESHOLD, vary from 0.0001 to 0.9999, fit/predict on train/test data')
model.fit(X_train, y_train)
optimal_th = 0.5 # start with default threshold value
for i in range(0,3):
score_list = []
print('\nLooping decimal place', i+1)
th_list = [np.linspace(optimal_th-0.4999, optimal_th+0.4999, 11),
np.linspace(optimal_th-0.1, optimal_th+0.1, 21),
np.linspace(optimal_th-0.01, optimal_th+0.01, 21)]
for th in th_list[i]:
y_pred = (model.predict_proba(X_test)[:,1] >= th)
f1scor = f1_score(y_test, y_pred)
score_list.append(f1scor)
print('{:.3f}->{:.4f}'.format(th, f1scor), end=', ') # display score in 4 decimal pl
optimal_th = float(th_list[i][score_list.index(max(score_list))])
print('optimal F1 score = {:.4f}'.format(max(score_list)))
print('optimal threshold = {:.3f}'.format(optimal_th))
print(model_name, 'accuracy score is')
print('Training: {:.2f}%'.format(100*model.score(X_train, y_train))) # score uses accuracy
print('Test set: {:.2f}%'.format(100*model.score(X_test, y_test))) # should use cross validation
y_pred = (model.predict_proba(X_test)[:,1] >= 0.25)
print('\nAdjust threshold to 0.25:')
print('Precision: {:.4f}, Recall: {:.4f}, F1 Score: {:.4f}'.format(
precision_score(y_test, y_pred), recall_score(y_test, y_pred), f1_score(y_test, y_pred)))
print(model_name, 'confusion matrix: \n', confusion_matrix(y_test, y_pred))
y_pred = model.predict(X_test)
print('\nDefault threshold of 0.50:')
print('Precision: {:.4f}, Recall: {:.4f}, F1 Score: {:.4f}'.format(
precision_score(y_test, y_pred), recall_score(y_test, y_pred), f1_score(y_test, y_pred)))
print(model_name, 'confusion matrix: \n', confusion_matrix(y_test, y_pred))
y_pred = (model.predict_proba(X_test)[:,1] >= 0.75)
print('\nAdjust threshold to 0.75:')
print('Precision: {:.4f}, Recall: {:.4f}, F1 Score: {:.4f}'.format(
precision_score(y_test, y_pred), recall_score(y_test, y_pred), f1_score(y_test, y_pred)))
print(model_name, 'confusion matrix: \n', confusion_matrix(y_test, y_pred))
y_pred = (model.predict_proba(X_test)[:,1] >= optimal_th)
print('\nOptimal threshold {:.3f}'.format(optimal_th))
print('Precision: {:.4f}, Recall: {:.4f}, F1 Score: {:.4f}'.format(
precision_score(y_test, y_pred), recall_score(y_test, y_pred), f1_score(y_test, y_pred)))
print(model_name, 'confusion matrix: \n', confusion_matrix(y_test, y_pred))
global model_f1, model_auc, model_ll, model_roc_auc
model_f1 = f1_score(y_test, y_pred)
y_pred = model.predict_proba(X_test)
model_ll = log_loss(y_test, y_pred)
print(model_name, 'Log-loss: {:.4f}'.format(model_ll))
y_pred = model.predict(X_test)
model_roc_auc = roc_auc_score(y_test, y_pred)
print(model_name, 'roc_auc_score: {:.4f}'.format(model_roc_auc))
y_pred = model.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
model_auc = auc(fpr, tpr)
print(model_name, 'AUC: {:.4f}'.format(model_auc))
# plot the ROC curve
plt.figure(figsize = [6,6])
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % model_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
return
# initialise lists to collect the results to plot later
model_list = []
f1_list = []
auc_list = []
ll_list = []
roc_auc_list = []
time_list = []
1. Logistic Regression
print('\n"""""" Logistic regression """"""')
logRe_final = LogisticRegression()
model_report('LogisticRegression', logRe_final, X_train, y_train,X_test,y_test)
model_list.append('LogisticRegression')
f1_list.append(model_f1)
auc_list.append(model_auc)
ll_list.append(model_ll)
roc_auc_list.append(model_roc_auc)
"""""" Logistic regression """""" Search for OPTIMAL THRESHOLD, vary from 0.0001 to 0.9999, fit/predict on train/test data Looping decimal place 1 0.000->0.6706, 0.100->0.7567, 0.200->0.7864, 0.300->0.8184, 0.400->0.8370, 0.500->0.8518, 0.600->0.8596, 0.700->0.7837, 0.800->0.6057, 0.900->0.2667, 1.000->0.0000, Looping decimal place 2 0.500->0.8518, 0.510->0.8553, 0.520->0.8571, 0.530->0.8571, 0.540->0.8608, 0.550->0.8608, 0.560->0.8644, 0.570->0.8681, 0.580->0.8632, 0.590->0.8627, 0.600->0.8596, 0.610->0.8584, 0.620->0.8553, 0.630->0.8496, 0.640->0.8463, 0.650->0.8352, 0.660->0.8318, 0.670->0.8265, 0.680->0.8074, 0.690->0.8038, 0.700->0.7837, Looping decimal place 3 0.560->0.8644, 0.561->0.8662, 0.562->0.8681, 0.563->0.8681, 0.564->0.8681, 0.565->0.8681, 0.566->0.8681, 0.567->0.8681, 0.568->0.8681, 0.569->0.8681, 0.570->0.8681, 0.571->0.8681, 0.572->0.8681, 0.573->0.8681, 0.574->0.8681, 0.575->0.8657, 0.576->0.8657, 0.577->0.8657, 0.578->0.8657, 0.579->0.8657, 0.580->0.8632, optimal F1 score = 0.8681 optimal threshold = 0.562 LogisticRegression accuracy score is Training: 86.24% Test set: 84.29% Adjust threshold to 0.25: Precision: 0.6857, Recall: 0.9474, F1 Score: 0.7956 LogisticRegression confusion matrix: [[125 99] [ 12 216]] Default threshold of 0.50: Precision: 0.8127, Recall: 0.8947, F1 Score: 0.8518 LogisticRegression confusion matrix: [[177 47] [ 24 204]] Adjust threshold to 0.75: Precision: 0.8696, Recall: 0.6140, F1 Score: 0.7198 LogisticRegression confusion matrix: [[203 21] [ 88 140]] Optimal threshold 0.562 Precision: 0.8430, Recall: 0.8947, F1 Score: 0.8681 LogisticRegression confusion matrix: [[186 38] [ 24 204]] LogisticRegression Log-loss: 0.4232 LogisticRegression roc_auc_score: 0.8425 LogisticRegression AUC: 0.8918
# classification report for logistic regression model
y_pred = logRe_final.predict(X_test)
print(classification_report(y_test, y_pred))
#Get the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True, fmt='g')
print("Accuracy Score : ", accuracy_score(y_test, y_pred))
precision recall f1-score support
0 0.88 0.79 0.83 224
1 0.81 0.89 0.85 228
accuracy 0.84 452
macro avg 0.85 0.84 0.84 452
weighted avg 0.85 0.84 0.84 452
Accuracy Score : 0.8429203539823009
2. RandomForestClassifier
#before parameter tunninig
#RandomForestClassifier
print('\n"""""" RandomForestClassifier """"""')
BP_Rfc_model = RandomForestClassifier()
model_report('RandomForestClassifier',BP_Rfc_model, X_train, y_train ,X_test ,y_test)
model_list.append('LogisticRegression')
f1_list.append(model_f1)
auc_list.append(model_auc)
ll_list.append(model_ll)
roc_auc_list.append(model_roc_auc)
"""""" RandomForestClassifier """""" Search for OPTIMAL THRESHOLD, vary from 0.0001 to 0.9999, fit/predict on train/test data Looping decimal place 1 0.000->0.6706, 0.100->0.7138, 0.200->0.8453, 0.300->0.8974, 0.400->0.9050, 0.500->0.9009, 0.600->0.8879, 0.700->0.8416, 0.800->0.7452, 0.900->0.4330, 1.000->0.0000, Looping decimal place 2 0.300->0.8974, 0.310->0.9010, 0.320->0.9047, 0.330->0.9065, 0.340->0.9043, 0.350->0.9061, 0.360->0.9080, 0.370->0.9098, 0.380->0.9095, 0.390->0.9095, 0.400->0.9050, 0.410->0.9064, 0.420->0.9102, 0.430->0.9102, 0.440->0.9099, 0.450->0.9091, 0.460->0.9064, 0.470->0.9013, 0.480->0.8989, 0.490->0.9009, 0.500->0.9009, Looping decimal place 3 0.410->0.9064, 0.411->0.9064, 0.412->0.9064, 0.413->0.9064, 0.414->0.9064, 0.415->0.9064, 0.416->0.9064, 0.417->0.9064, 0.418->0.9064, 0.419->0.9064, 0.420->0.9102, 0.421->0.9102, 0.422->0.9102, 0.423->0.9102, 0.424->0.9102, 0.425->0.9102, 0.426->0.9102, 0.427->0.9102, 0.428->0.9102, 0.429->0.9102, 0.430->0.9102, optimal F1 score = 0.9102 optimal threshold = 0.420 RandomForestClassifier accuracy score is Training: 100.00% Test set: 89.82% Adjust threshold to 0.25: Precision: 0.7887, Recall: 0.9825, F1 Score: 0.8750 RandomForestClassifier confusion matrix: [[164 60] [ 4 224]] Default threshold of 0.50: Precision: 0.8856, Recall: 0.9167, F1 Score: 0.9009 RandomForestClassifier confusion matrix: [[197 27] [ 19 209]] Adjust threshold to 0.75: Precision: 0.9748, Recall: 0.6798, F1 Score: 0.8010 RandomForestClassifier confusion matrix: [[220 4] [ 73 155]] Optimal threshold 0.420 Precision: 0.8685, Recall: 0.9561, F1 Score: 0.9102 RandomForestClassifier confusion matrix: [[191 33] [ 10 218]] RandomForestClassifier Log-loss: 0.2940 RandomForestClassifier roc_auc_score: 0.8981 RandomForestClassifier AUC: 0.9663
Classification report for randomforestclassifier model
#classification report for randomforestclassifier model before parameter tunning
y_pred = BP_Rfc_model.predict(X_test)
print(classification_report(y_test, y_pred))
#Get the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True, fmt='g')
print("Accuracy Score : ", accuracy_score(y_test, y_pred))
precision recall f1-score support
0 0.91 0.88 0.90 224
1 0.89 0.92 0.90 228
accuracy 0.90 452
macro avg 0.90 0.90 0.90 452
weighted avg 0.90 0.90 0.90 452
Accuracy Score : 0.8982300884955752
# After parameter tunning RandomForestClassifer
print('\n"""""" RandomForestClassifier """"""')
#
AF_Rfc_model= RandomForestClassifier(n_estimators=150,max_depth=6, criterion='entropy',random_state=30)
model_report('RandomForestClassifier',AF_Rfc_model, X_train, y_train ,X_test ,y_test)
model_list.append('LogisticRegression')
f1_list.append(model_f1)
auc_list.append(model_auc)
ll_list.append(model_ll)
roc_auc_list.append(model_roc_auc)
"""""" RandomForestClassifier """""" Search for OPTIMAL THRESHOLD, vary from 0.0001 to 0.9999, fit/predict on train/test data Looping decimal place 1 0.000->0.6706, 0.100->0.6868, 0.200->0.7559, 0.300->0.8733, 0.400->0.8948, 0.500->0.8837, 0.600->0.8591, 0.700->0.7760, 0.800->0.4984, 0.900->0.0840, 1.000->0.0000, Looping decimal place 2 0.300->0.8733, 0.310->0.8802, 0.320->0.8845, 0.330->0.8862, 0.340->0.8889, 0.350->0.8907, 0.360->0.8925, 0.370->0.8980, 0.380->0.8957, 0.390->0.8934, 0.400->0.8948, 0.410->0.8875, 0.420->0.8875, 0.430->0.8875, 0.440->0.8875, 0.450->0.8875, 0.460->0.8805, 0.470->0.8805, 0.480->0.8824, 0.490->0.8837, 0.500->0.8837, Looping decimal place 3 0.360->0.8925, 0.361->0.8925, 0.362->0.8925, 0.363->0.8943, 0.364->0.8961, 0.365->0.8961, 0.366->0.8961, 0.367->0.8980, 0.368->0.8980, 0.369->0.8980, 0.370->0.8980, 0.371->0.8980, 0.372->0.8980, 0.373->0.8980, 0.374->0.8980, 0.375->0.8980, 0.376->0.8980, 0.377->0.8980, 0.378->0.8957, 0.379->0.8957, 0.380->0.8957, optimal F1 score = 0.8980 optimal threshold = 0.367 RandomForestClassifier accuracy score is Training: 93.36% Test set: 87.83% Adjust threshold to 0.25: Precision: 0.7040, Recall: 0.9912, F1 Score: 0.8233 RandomForestClassifier confusion matrix: [[129 95] [ 2 226]] Default threshold of 0.50: Precision: 0.8531, Recall: 0.9167, F1 Score: 0.8837 RandomForestClassifier confusion matrix: [[188 36] [ 19 209]] Adjust threshold to 0.75: Precision: 0.9664, Recall: 0.5044, F1 Score: 0.6628 RandomForestClassifier confusion matrix: [[220 4] [113 115]] Optimal threshold 0.367 Precision: 0.8397, Recall: 0.9649, F1 Score: 0.8980 RandomForestClassifier confusion matrix: [[182 42] [ 8 220]] RandomForestClassifier Log-loss: 0.3675 RandomForestClassifier roc_auc_score: 0.8780 RandomForestClassifier AUC: 0.9493
#classification report for randomforestclassifer after paramter tunning
y_pred = AF_Rfc_model.predict(X_test)
print(classification_report(y_test, y_pred))
#Get the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True, fmt='g')
print("Accuracy Score : ", accuracy_score(y_test, y_pred))
precision recall f1-score support
0 0.91 0.84 0.87 224
1 0.85 0.92 0.88 228
accuracy 0.88 452
macro avg 0.88 0.88 0.88 452
weighted avg 0.88 0.88 0.88 452
Accuracy Score : 0.8783185840707964
3. XGBoostClassifier Before Parameter Tunning
# Before parameter tunning XGBoostClassifier
print('\n"""""" XGBoostClassifier """"""')
BF_xgb_model = XGBClassifier(colsample_bytree=0.7, gamma=0.1, learning_rate=0.05, max_depth=10, min_child_weight=1)
model_report('XgboostClassifier', BF_xgb_model, X_train, y_train, X_test, y_test)
model_list.append('XGBoostClassifier')
f1_list.append(model_f1)
auc_list.append(model_auc)
ll_list.append(model_ll)
roc_auc_list.append(model_roc_auc)
"""""" XGBoostClassifier """""" Search for OPTIMAL THRESHOLD, vary from 0.0001 to 0.9999, fit/predict on train/test data [19:04:18] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. Looping decimal place 1 0.000->0.6706, 0.100->0.8654, 0.200->0.8953, 0.300->0.8958, 0.400->0.9015, 0.500->0.9017, 0.600->0.9134, 0.700->0.9018, 0.800->0.8645, 0.900->0.7876, 1.000->0.0000, Looping decimal place 2 0.500->0.9017, 0.510->0.9017, 0.520->0.9036, 0.530->0.9036, 0.540->0.9036, 0.550->0.9036, 0.560->0.9036, 0.570->0.9056, 0.580->0.9075, 0.590->0.9095, 0.600->0.9134, 0.610->0.9154, 0.620->0.9174, 0.630->0.9170, 0.640->0.9170, 0.650->0.9147, 0.660->0.9099, 0.670->0.9095, 0.680->0.9047, 0.690->0.9022, 0.700->0.9018, Looping decimal place 3 0.610->0.9154, 0.611->0.9174, 0.612->0.9174, 0.613->0.9174, 0.614->0.9174, 0.615->0.9174, 0.616->0.9174, 0.617->0.9174, 0.618->0.9174, 0.619->0.9174, 0.620->0.9174, 0.621->0.9174, 0.622->0.9174, 0.623->0.9150, 0.624->0.9150, 0.625->0.9170, 0.626->0.9170, 0.627->0.9170, 0.628->0.9170, 0.629->0.9170, 0.630->0.9170, optimal F1 score = 0.9174 optimal threshold = 0.611 XgboostClassifier accuracy score is Training: 100.00% Test set: 89.82% Adjust threshold to 0.25: Precision: 0.8438, Recall: 0.9474, F1 Score: 0.8926 XgboostClassifier confusion matrix: [[184 40] [ 12 216]] Default threshold of 0.50: Precision: 0.8792, Recall: 0.9254, F1 Score: 0.9017 XgboostClassifier confusion matrix: [[195 29] [ 17 211]] Adjust threshold to 0.75: Precision: 0.9206, Recall: 0.8640, F1 Score: 0.8914 XgboostClassifier confusion matrix: [[207 17] [ 31 197]] Optimal threshold 0.611 Precision: 0.9095, Recall: 0.9254, F1 Score: 0.9174 XgboostClassifier confusion matrix: [[203 21] [ 17 211]] XgboostClassifier Log-loss: 0.2554 XgboostClassifier roc_auc_score: 0.8980 XgboostClassifier AUC: 0.9602
#accuracy
y_pred = BF_xgb_model.predict(X_test)
print(classification_report(y_test, y_pred))
#Get the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True, fmt='g')
print("Accuracy Score : ", accuracy_score(y_test, y_pred))
precision recall f1-score support
0 0.92 0.87 0.89 224
1 0.88 0.93 0.90 228
accuracy 0.90 452
macro avg 0.90 0.90 0.90 452
weighted avg 0.90 0.90 0.90 452
Accuracy Score : 0.8982300884955752
# After paramter tunning XGBoostClassifier
print('\n"""""" XGBoostClassifier """"""')
AF_xgb_model= XGBClassifier( gamma=0.01, learning_rate=0.0005, max_depth=15, min_child_weight=1)
model_report('XgboostClassifier', AF_xgb_model, X_train, y_train, X_test, y_test)
model_list.append('XGBoostClassifier')
f1_list.append(model_f1)
auc_list.append(model_auc)
ll_list.append(model_ll)
roc_auc_list.append(model_roc_auc)
"""""" XGBoostClassifier """""" Search for OPTIMAL THRESHOLD, vary from 0.0001 to 0.9999, fit/predict on train/test data [19:04:19] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. Looping decimal place 1 0.000->0.6706, 0.100->0.6706, 0.200->0.6706, 0.300->0.6706, 0.400->0.6706, 0.500->0.8653, 0.600->0.0000, 0.700->0.0000, 0.800->0.0000, 0.900->0.0000, 1.000->0.0000, Looping decimal place 2 0.400->0.6706, 0.410->0.6706, 0.420->0.6706, 0.430->0.6706, 0.440->0.6706, 0.450->0.6706, 0.460->0.6706, 0.470->0.6706, 0.480->0.8527, 0.490->0.8694, 0.500->0.8653, 0.510->0.8411, 0.520->0.8193, 0.530->0.0000, 0.540->0.0000, 0.550->0.0000, 0.560->0.0000, 0.570->0.0000, 0.580->0.0000, 0.590->0.0000, 0.600->0.0000, Looping decimal place 3 0.480->0.8527, 0.481->0.8527, 0.482->0.8520, 0.483->0.8537, 0.484->0.8571, 0.485->0.8641, 0.486->0.8641, 0.487->0.8641, 0.488->0.8641, 0.489->0.8676, 0.490->0.8694, 0.491->0.8706, 0.492->0.8724, 0.493->0.8760, 0.494->0.8773, 0.495->0.8768, 0.496->0.8747, 0.497->0.8723, 0.498->0.8769, 0.499->0.8745, 0.500->0.8653, optimal F1 score = 0.8773 optimal threshold = 0.494 XgboostClassifier accuracy score is Training: 94.50% Test set: 86.73% Adjust threshold to 0.25: Precision: 0.5044, Recall: 1.0000, F1 Score: 0.6706 XgboostClassifier confusion matrix: [[ 0 224] [ 0 228]] Default threshold of 0.50: Precision: 0.8784, Recall: 0.8553, F1 Score: 0.8667 XgboostClassifier confusion matrix: [[197 27] [ 33 195]] Adjust threshold to 0.75: Precision: 0.0000, Recall: 0.0000, F1 Score: 0.0000 XgboostClassifier confusion matrix: [[224 0] [228 0]] Optimal threshold 0.494 Precision: 0.8340, Recall: 0.9254, F1 Score: 0.8773 XgboostClassifier confusion matrix: [[182 42] [ 17 211]] XgboostClassifier Log-loss: 0.6623 XgboostClassifier roc_auc_score: 0.8674 XgboostClassifier AUC: 0.9072
y_pred = AF_xgb_model.predict(X_test)
print(classification_report(y_test, y_pred))
#Get the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True, fmt='g')
print("Accuracy Score : ", accuracy_score(y_test, y_pred))
precision recall f1-score support
0 0.86 0.88 0.87 224
1 0.88 0.86 0.87 228
accuracy 0.87 452
macro avg 0.87 0.87 0.87 452
weighted avg 0.87 0.87 0.87 452
Accuracy Score : 0.8672566371681416
- https://seaborn.pydata.org/tutorial/categorical.html [visualization]
- https://plotly.com/python/v4-migration/ [interactive visualization] 2.1 chi square Test-feature selection [ ALEngineering channel]
- https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 [ Handling imbalance dataset]
3.1 https://www.analyticsvidhya.com/blog/2017/03/imbalanced-data-classification/ [handling imbalance data] 3.2 https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ [ handling imbalanced data]https://www.youtube.com/watch?v=fxw_Ak4t-LY [types of encoding]
4.1 https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv [ mean encoding concept]
https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/ [threshold value] 5.1 https://www.youtube.com/watch?v=_AjhdXuXEDE [finding optimal
threshold for classification]5.1 https://numpy.org/doc/stable/reference/generated/numpy.linspace.html [ numpy package for computation ] 5.3 https://www.kaggle.com/nirajvermafcb/comparing-various-ml-models-roc-curve-comparison
[model evaluation metrics]
- Robust logistic regression for insurance risk classification (repec.org) 8.Vehicle insurance — Random forest classifier | Aviral Bhardwaj | Medium | Medium Accuracy vs. F1-Score. A comparison between Accuracy and… | by Purva Huilgol | Analytics Vidhya | Medium </pre>